Spain is not different: teaching quantitative courses can also be hazardous to one’s career (at least in undergraduate courses)

Student evaluations of teaching (SETs) have become a widely used tool for assessing teaching in higher education. However, numerous investigations have shown that SETs are subject to multiple biases, one of which is particularly relevant, namely, the area of knowledge to which the subject belongs. This article aims to replicate the article by Uttl & Smibert (2017, https://doi.org/10.7717/peerj.3299) in a different educational context to verify whether the negative bias toward instructors who teach quantitative courses found by the authors in the US also appears in the Spanish university system. The study was conducted at the Business and Law School of the Universidad Pontificia Comillas, a private Spanish university, using two different samples. First, we analyzed undergraduate courses using a sample of 80,667 SETs in which 2,885 classes (defined as a single semester-long course taught by an individual instructor to a specific group of students), 488 instructors, and 322 different courses were evaluated over a time period of four academic years (2016/2017–2019/2020). Second, in the same period, 16,083 SETs corresponding to master’s degree courses were analyzed, which involved the study of 871 classes, 275 instructors, and 155 different courses. All the data included in the analysis were obtained from official university surveys developed by a team of professionals specialized in teaching quality responsible for ensuring the reliability of the information. At the degree level, the results show that despite the considerable cultural and temporal difference between the samples, the results are very similar to those obtained by Uttl & Smibert (2017, https://doi.org/10.7717/peerj.3299); i.e., professors teaching quantitative courses are far more likely to obtain worse SETs than instructors in other areas. There are hardly any differences at the master’s degree level, regardless of whether nearly 75% of master’s degree instructors also teach at the undergraduate level. This leads us to three different conclusions. (1) Evidence suggests that the reason for these differences is not due to faculty teaching quantitative courses being less effective than faculty teaching in some other fields. Our results indicate that the same instructor is evaluated very differently depending on whether he or she teaches at the undergraduate or master’s level. (2) It is essential to avoid comparisons of SETs between different areas of knowledge, at least at the undergraduate level. (3) A significant change in the use and interpretation of SETs is imperative, or its replacement by other evaluation mechanisms should be considered. If this does not occur, it is possible that in the future, there will be an adverse selection effect among professors of quantitative methods; i.e., only the worst professionals in quantitative methods will opt for teaching since the good professionals will prefer other jobs.


INTRODUCTION
Student evaluations of teaching (SETs), even though they are widely used in universities around the world, continue to be surrounded by controversy as a result of the noninstructional biases that various studies have identified (see, for example, the recent review developed by Uttl, 2021). It is known that a large number of factors unrelated to teaching quality can bias the outcome of these evaluations. Some authors even go a step further, as in the case of Clayson (2022: 323), who proposes the likability hypothesis, suggesting that "the current evaluation system cannot validly measure anything other than what the students like and dislike".
To mention just a few of the biases identified in the literature, there is strong evidence that instructors who give better grades obtain better SETs than those who are more demanding, although the learning achieved by students is higher in the second group. In a recent article, Berezvai, Lukáts & Molontay (2021: 793) concluded that "increasing the grade of a student by one will cause them to give approximately 0.2-0.4 higher evaluations for the instructor in the SET survey". Indeed, several studies have reported that teachers who award higher grades receive better SETs than those who focus on the deeper learning that can be observed in later courses (Yunker & Yunker, 2003;Carrell & West, 2010;Kornell & Hausman, 2016). Braga, Paccagnella & Pellizzari (2014: 71) offered the following possible explanation: "teachers can either engage in real teaching or in teaching-to-the-test, the former requiring higher students' effort than the latter. Teaching-to-the-test guarantees high grades in the current course but does not improve future outcomes".
Related to the above, Felton, Mitchell & Stinson (2004) identified a student preference for course easiness using "Rate my professor" (RateMyProfessors.com) ratings. This result has been replicated using the same source by Felton et al. (2008), Rosen (2018), and Wallisch & Cachia (2019). Additionally, Arroyo- Barrigüete et al. (2021), using data from the Universidad Pontificia Comillas, concluded that instructors offering easy courses (low workload) tend to be rated highly. Uttl, Bell & Banks (2018) showed that class size also affects SETs, with a curvilinear relationship; i.e., SET ratings are the highest in the smallest classes, decline as the class size increases to 20-30 students, and then level off.
Many other biases have been identified, although it is true that in some cases, the evidence is contradictory or open to alternative interpretations. To give a few examples, Sanchez & Khan (2016) concluded that an instructor´s accent in online education does cause learners to rate the instructor as less effective, although this accent did not seem to affect the learning outcomes. However, students indeed reported greater comprehension difficulties with the accented instructor. Therefore, perhaps the bias detected is related to the additional effort needed, as comprehension scores decrease for nonnative (accented) instructors (Anderson-Hsieh & Koehler, 1988). This effect is also more noticeable in Sanchez & Khan's experiment because the narrator was never presented in any visual capacity to the participants; thus, important comprehension signals such as gesturing (Goldin-Meadow, Kim & Singer, 1999) were lost. Stonebraker & Stone (2015) found that instructor age negatively affects SETs, although the effect does not begin until the mid-forties. On the other hand, as the authors themselves indicate, the impact of age on SETs is small and can be offset by other factors, especially the physical appearance (the effect of age disappears for professors rated as "hot") and how easy students consider them to be. The recent work by Tran & Do (2020) on a sample of 12,713 SETs of 124 courses concluded that age had no significant impact on SETs.
Finally, hundreds of studies investigated gender bias, but their conclusions are conflicting. This issue is not clear, as the gender effect could be an artifact of sample sizes, seniority, or field (see Uttl & Violo, 2021). As Uttl (2021: 246) indicates, "Gender differences could arise, be reduced, or even masked by a number of different factors". Therefore, it is not completely clear whether there is indeed a gender bias or whether this is an artifact of other covariates.
Furthermore, we could mention many other potential noninstructional biases identified in various research studies, demonstrating that SETs depend on multiple factors unrelated to professor teaching effectiveness. Clayson (2009: 16) concluded that there exists a small average relationship between learning and SETs, and "the more objectively learning is measured, the less likely it is to be related to the evaluations". The meta-analysis by Uttl, White & Gonzalez (2017) concluded that large sample-sized studies showed minimal correlation or even no correlation at all between SETs and learning, and consequently, SETs are not a valid measure of faculty teaching effectiveness. This conclusion is shared by Hornstein (2017) and Stroebe (2020), among others.
Nevertheless, the recent work by Uttl & Smibert (2017) highlights that a particularly relevant noninstructional bias, the subject area to which the course belongs, has likely been undervalued in many previous studies. Numerous articles have found a negative bias toward teachers who teach quantitative and/or STEM courses (Cashin, 1990;Beran & Violato, 2005;Centra, 2009;Uttl, White & Morin, 2013;Royal & Stockdale, 2015;DeFrain, 2016;Rosen, 2018;Arroyo-Barrigüete et al., 2021), but Uttl and Smibert claim that in several of these articles, the relevance of this bias is underestimated. They stated that parametric statistics are not appropriate due to several factors. First, the distributions of SETs are frequently negatively skewed due to ceiling effects. Second, to evaluate the effect of a variable on SETs, the method to be employed is the one that best reflects the effect of that variable on classifying professors as satisfactory/not satisfactory. Regardless of the percentage of the variance explained by that factor, the relevant metric is the risk of obtaining SETs below the cutoff value because this metric is the one that is frequently used to distinguish between good and bad performance. Therefore, the most appropriate effect size indices may be the relative risk ratio or odds ratios of professors passing the cutoff criteria instead of ds, rs, or R 2 . Consequently, using ds, rs or R 2 results in underestimating the size of the effect, which does not seem as important as it is and can have a considerable impact on professors passing the minimum standard for satisfactory performance.
In a recent study by the same authors of the present research (Arroyo- Barrigüete et al., 2021), it was found that in the case of a business and law school (ICADE), the most relevant noninstructional variable was precisely whether the course belonged to the area of quantitative methods. However, the relative risk ratios of this area were not calculated, nor were their implications explored in depth; the focus of that work was on identifying the most relevant factors from among a total of 31 noninstructional variables identified in the literature, with a particular focus on the differences between areas, but without prioritizing anyone in particular. Starting from those results, that is, the notable negative bias identified toward quantitative courses, this article replicates the research of Uttl & Smibert (2017). There are three objectives. First, the aforementioned article uses a sample from a midsized US university. However, in this article, a sample from a midsized Spanish university is used. The second difference lies in the period under study since all the evaluations in the Uttl and Smibert study are prior to 2008, while the data used in this article correspond to 2016-2019. These two differences will allow us to assess whether the results of Uttl and Smibert can be extrapolated to other educational contexts and are also stable over time. Second, we intend to check whether these results are consistent when master's degree courses are also considered. Therefore, we have added a sample of SETs at this educational level. Finally, in this article, we compare quantitative courses against every other area without focusing specifically on the area of languages to verify whether the differences found by Uttl and Smibert can be extrapolated to other fields of knowledge.

MATERIALS AND METHODS
This article received ethical approval from the Ethics Committee of the Universidad Pontificia Comillas (approval number 2021/94). The research is based on the results obtained by Arroyo- Barrigüete et al. (2021), going deeper into one of the findings of that work; in the case of a business and law school, the area of knowledge to which the course belongs induces the greatest bias. In that article, it was confirmed that once we control for the GPA effect (professors who give better grades obtain better SETs), workload (instructors offering easy courses tend to be rated highly), and other noninstructional biases, there is still a significant effect related to the type of course (area of knowledge). Furthermore, as Uttl & Smibert (2017) state, in addition to the percentage of variance explained by this variable, the marked negative skewness of the distributions means that the probabilities of obtaining poor SETs are very different depending on the type of course. Thus, starting from the results obtained in both investigations, this work aims to replicate the paper by Uttl & Smibert (2017).
The study was conducted at the Universidad Pontificia Comillas, a private Spanish university founded in 1890. The university comprises seven different schools, has over 13,000 students and approximately 1,700 lecturers, and offers 43 different undergraduate degrees and 33 master's and doctoral programs. In this article, we have worked with the business and law school due to the aforementioned results obtained by Arroyo- Barrigüete et al. (2021). The sample for the undergraduate courses consists of 80,667 SETs and 2,885 classes in which 488 instructors and 322 different courses were evaluated over a time period of four academic years (2016/2017-2019/2020). The sample for the master courses consists of 16,083 SETs and 871 classes in which 275 instructors and 155 different courses were evaluated over the same period.
We used the class as the unit of analysis (a course taught by an instructor to a specific group of students). For each class, the average of the evaluations received by the instructor was calculated since it is the class average instead of individual student SETs that determines provosts' assessments of university faculty.
The classification between quantitative and nonquantitative courses was made according to their content. All those with a strong mathematical, statistical or econometric content have been classified as quantitative. Therefore, an econometrics course has been classified as a quantitative course. Likewise, an experimental design course (with statistical content) or a financial mathematics course were also classified as quantitative. In the specific case of Universidad Pontificia Comillas, this classification is simple since all courses of this kind are taught by the Quantitative Methods Department.
A comparison of the density distributions for quantitative courses and the remaining courses was conducted, performing a k-sample Anderson-Darling test. We adopted a conservative alpha level of 0.005 to avoid false-positives (Benjamin et al., 2018). Additionally, in all cases, we calculated the risk ratio, which is the ratio of the probability of an outcome (SETs below the cutoff value used by the university to distinguish good and bad teachers) in an exposed group (quantitative courses) to the probability of an outcome in an unexposed group (nonquantitative courses). Mathematically, this is calculated as the result of dividing the cumulative incidence in the exposed group by the cumulative incidence in the unexposed group.
All the data included in this analysis were obtained from official university surveys developed by a team of experts in teaching quality. Table 1 shows the main statistics of the sample.
Although the difference is minimal, it should be noted that the sum of professors by area does not exactly coincide with the number of professors included in the sample (275master courses/488undergraduate courses), since a few instructors teach in several categories (typically in courses in their area of knowledge and in the area of "general contents").
The analyses were conducted in R (R Core Team, 2020), a software environment for statistical computing and graphics, using several packages to elaborate the code: gridExtra (Auguie, 2017), readxl (Wickham & Bryan, 2019), fmsb (Nakazawa, 2019), KSamples (Scholz & Zhu, 2019), RcmdrMisc (Fox, 2020), dplyr (Wickham et al., 2020) and ggplot2 (Wickham, 2016). Figure 1 shows the smoothed density distribution of the mean ratings for quantitative courses and for nonquantitative courses (degree programs) on a scale from 1 to 10. The figure highlights that the distributions are considerably different among quantitative and nonquantitative courses, with the difference being an average value of one point  (vertical lines), i.e., 7.14 for quantitative courses and 8.13 for nonquantitative courses. Following Uttl & Smibert (2017), the figure was generated using the R function "density" with a smoothing kernel set to "Gaussian." The p-value obtained in the k-sample Anderson-Darling test is very low (1.88E−37), so we can state that the two distributions are different. Figure 2 is identical but only considers the master's programs, and we can observe a very different scenario, namely, both distributions and average values are quite similar (8.43 for quantitative courses and 8.34 for nonquantitative courses). The p-value obtained in the k-sample Anderson-Darling test is relatively high (0.099), so we cannot state that the two distributions are different. Figure 3 shows the density distributions for quantitative courses and the remaining courses at the undergraduate level (languages, general contents, law, management, finance, economics, marketing, and international relations). The vertical lines indicate the average values. There are several conclusions related to this figure. (1) Regardless of the area, the instructors of quantitative courses obtain worse evaluations than those of other areas. In some cases, the differences are considerable, as in the case of languages and law, and in others, they are not as pronounced, as in the case of marketing. The K-sample Anderson-Darling test shows statistically significant differences for all comparisons (see Table 2). (2) In the remaining areas, the distributions of the ratings are negatively skewed, and the only area in which this does not happen is precisely in quantitative methods. (3) Combining both effects, we can conclude that regardless of the cutoff criterion used by the university to distinguish good and bad teachers, instructors in the area of quantitative methods will always be disadvantaged in comparative terms. Figure 4 is identical but only considers the master's programs, and again, we can observe a very different scenario. (1) The instructors of quantitative courses obtain similar or even better evaluations than instructors of other areas. The K-sample Anderson-Darling test shows no statistically significant differences for any comparison (see Table 2).

RESULTS
(2) Evaluations in quantitative methods present negative asymmetry, and (3) both factors lead to instructors in this area not being disadvantaged by the application of equal cutoff criteria for all areas. However, regardless of the shape of the distributions, universities typically use a certain cutoff value to distinguish professors with good or bad performance. In the Spanish system, one of the usual forms of calculating SETs is on a scale of 1 to 10. It is an easy-tounderstand scale that replicates the one used to evaluate students. The cutoff value is usually approximately 7 or 8; teachers with SETs below this value should improve their teaching. Therefore, in Fig. 5, we calculated the relative risk ratio for different cutoff values in the undergraduate programs using an interval from 7 to 9, covering all possible values   within reason. The relative risk ratios were calculated in all cases compared to quantitative methods courses. We also calculated the percentage of instructors who would fall below each cutoff value. The numerical data can be found in Table 3. The results confirm that regardless of the cutoff value selected, quantitative methods instructors are always disadvantaged, dramatically in some cases (as is the case when compared with languages). Moreover, the problem is not solved by lowering the cutoff value since, due to the shape of the distributions, such a measure will decrease the percentage of professors below the threshold value but will increase the relative risk ratio with the remaining courses.
Focusing on a specific case, for a cutoff value of 8, we find that 72.6% of the professors of quantitative methods would be included in the "poor performance" category.
In comparison, the percentages are substantially lower in the remaining areas: 28.3% in languages, 34.0% in law, 37.6% in general contents, 38.4% in finance, 40.4% in international relations, 42.7% in management, 45.8% in economics and 56.3% in marketing, implying that the relative risk ratio is much higher with the resulting consequences for career development.
In the case of the master's programs, again, the situation is different. The same analysis shows that most relative risk ratios are not significant. Only for some cutoff values are they significant, and these values are different depending on the areas (see Table 4). Therefore, although there are statistically significant differences for some values, sometimes the differences favor and sometimes disfavor the quantitative methods courses without establishing a single criterion. The conclusion, therefore, is that in master's programs, quantitative methods instructors are neither disadvantaged nor favored with respect to other areas.

DISCUSSION
First, it is very interesting to confirm that the results obtained by Uttl & Smibert (2017) for the comparison between math and English undergraduate courses are very similar to those obtained in this article. Table 5 uses the same cutoff criteria as those authors for comparative purposes. The similarities are remarkable, which is of particular interest because they come from completely different educational contexts (USA vs. Spain) and very different time periods (SETs prior to 2008 vs. SETs from 2016-2019). Therefore, the first conclusion is that the results obtained by Uttl & Smibert (2017) are remarkably robust even when the sample considered is substantially modified.
In this sense, it has been confirmed that in undergraduate courses, the area of knowledge induces a considerable bias in SETs (Cashin, 1990;Beran & Violato, 2005;Centra, 2009;Uttl, White & Morin, 2013;Royal & Stockdale, 2015;Uttl & Smibert, 2017;Arroyo-Barrigüete et al., 2021). Furthermore, it is also clear that this bias has relevance, regardless of the percentage of the variance explained by that factor. Instructors of quantitative methods have a much higher risk of obtaining SETs below the cutoff value than instructors of other areas. Moreover, this conclusion is the same for any cutoff value and for any other area. Simply put, these professors suffer a penalty for the mere fact that they are teaching quantitative methods. Of course, the differences are more significant for some areas (languages or law) than for others (economics or marketing). However, in all cases, regardless of the cutoff value chosen by the university to differentiate between good and bad performance, quantitative methods instructors will be at much greater risk of being included in the "poor performance" category. It could be argued that to the extent that crude analysis has been conducted without controlling for the numerous confounding factors that affect SETs, the results may be distorted. Moreover, as some authors have suggested, perhaps the problem is precisely that teachers of quantitative courses are worse than teachers of other areas, so their evaluations do not respond to noninstructional biases but to this fact. Cashin (1990: 119) suggests as a possible explanation that "some academic fields are poorly taught. Many of the low-rated fields are those in which institutions must pay very high salaries even to compete modestly with business and industry. Perhaps the faculty teaching those courses are less effective as a group than faculty in some other fields. It costs far less to hire an outstanding teacher in English than it does to hire an outstanding teacher in computer science, accounting, or engineering". Our analysis suggests that this is not the reason. To explore this possibility, we selected only those professors who simultaneously teach bachelor's and master's degree courses, focusing on quantitative courses. Table 6 shows paired data for each instructor. As can be seen, in all cases and without exception, there is an improvement of SETs in master, ranging from 3.6% to 57.7%. In addition, the percentage of courses with a SET below the cutoff value of 8 also varies significantly in almost all cases. Figure 6 shows the smoothed density distribution of the mean ratings for quantitative courses at the undergraduate level (112 classes) and for the master's level (65 classes), considering only those instructors. The figure was generated using the R function "density" with a smoothing kernel set to "Gaussian". The figure highlights that the distributions are considerably different (the p-value obtained in the k-sample Anderson-Darling test is 1.6 E−15), with the difference being an average value of 1.4 points (vertical lines). Discrepancies are considerable, although we are analyzing the same group of instructors in both cases. Certainly, given that class size in master courses is significantly smaller, it is possible that part of the difference is due to this bias (see Uttl, Bell & Banks, 2018). However, the differences are too large for that to be the explanatory factor, especially when areas such as "management", "general contents," and "law" do not see the same effect. In other words, the same professor is evaluated differently depending on whether he or she teaches undergraduate or master's courses. This evidence suggests that the marked negative bias toward quantitative courses at the undergraduate level is not due to the deficits of the instructors themselves, since such deficits would also be seen at the master's level, which is not the case. However, other possible explanations need to be considered. First, it is possible that instructors are more motivated or put more effort into master teaching than into undergraduate teaching. In the specific case of Universidad Pontificia Comillas, this is unlikely to be the case. On the one hand, let us consider the instructor who wants to teach to SETs (to get the highest SETs possible). For this instructor profile, it makes no sense to differentiate between undergraduate and master courses because the evaluation of instructors does not distinguish between the two levels; an instructor who aspires to be promoted must obtain good SETs in both undergraduate courses and master courses. Focusing efforts exclusively on master courses would lead to a negative evaluation in undergraduate courses and, consequently, a loss of promotion. Perhaps there is a little extra motivation for part-time teachers, as they are teaching potential future colleagues. However, even in these cases, if their evaluations in undergraduate courses are low, they will probably be dismissed. On the other hand, let us consider the instructor who enjoys teaching and SETs are not his or her priority. Most likely, this instructor will enjoy teaching students who are there to learn and may invest more effort teaching master-level students who are more likely to be interested in the subject. We cannot rule out this possibility, but there are some indications that it is not a common occurrence. The first is that the improvement in master courses does not occur in all subjects, and in "management", "general content," and "law," the SETs are virtually identical. The second is that if this type of behavior is detected at the university, the instructor will receive a warning for not paying enough attention to the undergraduate courses. At Universidad Pontificia Comillas, undergraduate programs are one of the cornerstones of its educational strategy, with a much higher volume of students than in master's or PhD programs. They are therefore a priority, and instructors would not be allowed to neglect them to devote more effort to master's courses.
A second possibility is that instructors opt to use their better teaching assistants for the master courses. Actually, this is not possible in the Universidad Pontificia Comillas since there are no teaching assistants. In fact, this role is not widespread in the Spanish university system. Therefore, the only possible explanation is that the student population is very different at the master level than at the undergraduate level. Students who choose a master's degree with quantitative content are genuinely interested in it, as there are many other alternatives available that are free of such content. In contrast, undergraduate business administration students are forced to take quantitative courses. Therefore, the question that arises is whether the effect detected in quantitative courses is a problem of mandatory vs. elective classes. However, in the Spanish university system, nearly 80-90% of all undergraduate courses are mandatory. In the sample of undergraduate courses considered in this article, 223 classes (7.7%) correspond to elective courses, and 2,662 classes (92.3%) correspond to mandatory courses. Nevertheless, the enormous negative bias toward quantitative courses has not been detected in other disciplines where almost all courses are also mandatory. In our opinion, a possible explanation for this phenomenon can be found in motivational factors; as mentioned, the student population is very different at the master level than at the undergraduate level. A student who wishes to pursue a degree in business administration is more likely to be interested in finance or management courses than in mathematics or statistics. These quantitative courses are compulsory to obtain a diploma, so the student must study them, but their approach to the course will probably not be the most positive. This effect was detected by Uttl, White & Morin (2013) among undergraduate psychology students; out of 340 participants, fewer than 10 were very interested in taking a quantitative course, but 159 were very interested in taking the Introduction to the Psychology of Abnormal Behavior. In fact, the mean interest in statistics courses was nearly 6 standard deviations below the mean interest in nonquantitative courses. In other words, the Spanish university system forces undergraduate business administration students to take quantitative courses that are probably very unattractive to them. In addition, there is no possibility of avoiding these courses. However, master's degrees allow a high level of specialization so that it is possible to choose the path that best suits each student's interests. Students who choose a master's degree with quantitative content are truly interested in it, as there are many other alternatives available that are free of such content. Hence, their approach to the courses, and therefore the SETs, are more positive.
We must be aware that the mathematical level in Spain is low compared to that in other countries. The latest PISA report, which measures the mathematical literacy of 15-year-old students, indicates that the mathematical level in Spain is below the OECD average (OECD, 2018). This leads to a clear disinterest in this subject. A recent study in Spain (Pedrosa, 2020) on a sample of 1,293 students with various degrees (food and agricultural engineering, biology, food science and technology, childhood education, computer engineering, primary education and tourism) concluded that "Students do not like mathematics, do not enjoy using it, do not enjoy talking about it, and do not feel motivated to study it, so they would not take mathematics courses voluntarily, nor would they want a job in which they would have to use it" (pp. 174-175). Interest in taking courses has an important effect on SETs, as has been shown in several studies (Hoyt & Lee, 2002;Rampichini, Grilli & Petrucci, 2004;Wolbring, 2012;La Rocca et al., 2017;Sulis, Porcu & Capursi, 2019).
The perceived usefulness of the course could also play a relevant role; it is quite possible that master's students perceive quantitative courses as fundamental for career development. Undergraduate students who also tend to take these courses in their first years and are therefore still far from the labor market find it more difficult to perceive their usefulness in terms of professional development. Finally, we can hypothesize a possible "gratification effect"; i.e., perhaps quantitative courses do not provide gratification to business administration students due to several factors. First, these courses are perceived as a continuation of high school, so the "discovery factor" is less than that in other completely new subjects. Second, quantitative courses do not usually provide immediate solutions to real problems but at best provide a mediated solution, something that students in the first years are likely to perceive negatively. Nevertheless, it is necessary to highlight that the "gratification effect" is a purely speculative hypothesis, as it is possible that confounding factors account for part (or much) of this effect.
Regardless of the causes, from our point of view, the results obtained make it necessary to reconsider the use of SETs in many Spanish universities, where it is relatively common to compare evaluations in all areas.
The main limitation of this article is that we have considered only one university institution and only the business and law school. For this reason, we propose a future line of research that replicates this work by considering a more diverse sample. However, since the results are very similar to those of Uttl & Smibert (2017), even with the considerable differences in the samples used, we believe that very similar results will most likely be obtained. Therefore, in our opinion, the findings obtained enjoy considerable robustness. A second limitation is that we have not controlled for perceived hardness, workload needed, or other variables that may act as confounding factors that could potentially account for the differential pattern between undergraduate and master courses. However, this limitation does not invalidate the main conclusion of this article; regardless of the causes (higher perceived hardness, higher workload needed, or simply that students do not like mathematics), it is clear that in undergraduate courses, there exists a considerable negative bias in SETs, which is detrimental to quantitative courses.

CONCLUSIONS
Throughout this article, we analyzed SETs at a midsize Spanish university from 2016/ 2017-2019/2020. All the data included in this analysis were obtained from official university surveys developed by a team of experts in teaching quality. The sample for the undergraduate courses consists of 80,667 SETs and 2,885 classes in which 488 instructors and 322 different courses were evaluated. The sample for the master's courses consists of 16,083 SETs, 871 classes, 275 instructors, and 155 different courses. The results show that in the case of undergraduate courses, there is a considerable difference between courses that use quantitative methods and other areas. The relative risk rates, regardless of the cutoff criterion used to distinguish between "good performance" and "bad performance", are clearly unfavorable toward instructors of quantitative courses, who are much more likely to be classified as "bad teachers". While the differences depend on the area against which the comparison is made, in all cases, the comparison is unfavorable. Most interesting is the consistency with Uttl & Smibert's (2017) analysis, as strikingly similar results have been obtained even with considerable differences between the samples used. However, in the case of master's courses, the situation is entirely different, and no significant differences between quantitative courses and the remaining courses are apparent, except for some cutoff criteria.
It seems clear that quantitative courses are not appreciated by students, at least in undergraduate courses in business and law school. This is probably due to a problem that goes back to high school. In this sense, perhaps more effort should be made in preuniversity studies to make this type of course more attractive by giving the courses greater proximity to real problems and using a more practical approach.
These results lead to three different conclusions. First, the hypothesis that the marked negative bias toward quantitative courses in undergraduate courses is due to the instructors' deficits seems to be unjustified. Our results suggest that the same professor is evaluated very differently depending on whether he or she teaches undergraduate or master's degree courses. Second, it is essential to avoid comparing SETs between different areas of knowledge, at least at the undergraduate level. The results clearly show a strong negative bias toward professors who teach quantitative courses, so any comparison with the SETs obtained by professors in other areas of knowledge is clearly unfair. The most obvious example is found in Table 5. If we use a cutoff criterion equal to the average SET in all subjects, 74.9% of quantitative methods instructors will be included in the "bad performance" category. However, only 30.9% of language instructors fall into this category. This leads us to the third conclusion, namely, if there is no significant change in the use and interpretation of SETs, an adverse selection process may occur. Cashin (1990) argued that many of the low-rated fields require institutions to pay very high salaries to compete with business and industry. In other words, only the worst professionals opted for teaching since the good professionals had better alternatives in the industry, which meant that teachers were worse in these areas than in other areas. Our results suggest that this is not the case. However, this is a substantial risk for the future. Currently, a specialist in quantitative methods has more economically attractive alternatives in the industry than in universities, at least in the case of Spain. After 5 years of doctoral studies, the best option in a university is a position as an "Ayudante Doctor" (the lowest level in the Spanish system), which implies a salary of between 25,000 and 28,000 euros per year, depending on the university. This salary is practically the same as what a graduate in science without any additional studies can obtain in the corporate world immediately after finishing his or her degree. Furthermore, the salary is undoubtedly much lower than the salary that he or she will obtain after 5 years. Therefore, if he or she chooses to develop his or her professional career in academia, it is because of a strong vocational calling. However, if he or she finds himself/herself in an environment in which he or she obtains systematically lower SETs than his or her peers in other areas, which results in lower possibilities of receiving tenure, promotion, or merit pay, he or she may reconsider the decision. This would lead to Cashin's statement becoming true; in the long run, it is possible that only professionals who are not able to obtain a better job in the industry will opt for academia. If there are no changes in the use and interpretation of SETs in a few years, we may see that the differences between the SETs received by instructors of quantitative methods and those of other areas are even greater. Then, there will be an objective reason for this.
From a more general perspective, concerning the validity of the SETs, this work proves that the area of knowledge generates a considerable bias in SETs. Therefore, the practice of comparing SETs between different areas of knowledge should be discontinued immediately. One possibility to correct this bias is to establish a within-field percentile of the rating average. The cutoff criterion to distinguish good and bad teachers would be the percentile determined by the university. In this way, the comparison would be carried out exclusively between instructors in the same area. It would also be possible to calculate the within-field z-score for each professor; the number of standard deviations a given instructor lies above or below the mean would in this case be the indicator to use. Unlike what happens with other noninstructional biases such as accent bias or gender bias, in this case, the solution seems feasible and depends only on a political decision, that is, on the way SETs are used. However, another problem arises; countless previous studies have shown that SETs depend on multiple factors unrelated to professors' effectiveness. Therefore, to make SETs reliable instruments, it would be necessary to correct these biases as well. This is, for example, the proposal by Berezvai, Lukáts & Molontay (2021: 806) to correct the bias derived from GPA (teachers who award higher grades get better SETs): "A potential way to obtain an unbiased SET score would be to subtract 0.3 or 0.4 multiplied by the difference in course grade average compared to the average grade of all students in the entire faculty". This approach would lead to the need to correct all biases similarly. Then, a third problem emerges; given the results of previous research, it seems that both noninstructional biases and their magnitude may vary significantly even from one class to another, depending on the characteristics of the learners. Thus, the correction rules would have to be specific to virtually every class, which is not possible. Consequently, we are skeptical about the possibility of correcting the results of SETs to make them valid instruments. We think it is possible to mitigate noninstructional biases, and a good start would be to eliminate comparisons between different areas of knowledge, but it does not seem possible to eliminate biases completely. SETs can perhaps be used to identify extreme cases (very high or very low performers, with similar behavior over a long period). However, even in these cases, the results should be taken with caution, aware that there is no certainty that these results are indicators of better teaching effectiveness but that they may be due to the sum of multiple unrelated factors. In conclusion, alternative mechanisms for faculty evaluation should be sought. Moreover, those universities that, despite all the problems indicated above, decide to continue using SETs should at least use them in a very different way.